Predicting clinical outcomes of SARS-CoV-2 infection during the Omicron wave using machine learning

The Omicron SARS-CoV-2 variant continues to strain healthcare systems. Developing tools that facilitate the identification of patients at highest risk of adverse outcomes is a priority. The study objectives are to develop population-scale predictive models that: 1) identify predictors of adverse outcomes with Omicron surge SARS-CoV-2 infections, and 2) predict the impact of prioritized vaccination of high-risk groups for said outcome. We prepared a retrospective longitudinal observational study of a national cohort of 172,814 patients in the U.S. Veteran Health Administration who tested positive for SARS-CoV-2 from January 15 to August 15, 2022. We utilized sociodemographic characteristics, comorbidities, and vaccination status, at time of testing positive for SARS-CoV-2 to predict hospitalization, escalation of care (high-flow oxygen, mechanical ventilation, vasopressor use, dialysis, or extracorporeal membrane oxygenation), and death within 30 days. Machine learning models demonstrated that advanced age, high comorbidity burden, lower body mass index, unvaccinated status, and oral anticoagulant use were the important predictors of hospitalization and escalation of care. Similar factors predicted death. However, anticoagulant use did not predict mortality risk. The all-cause death model showed the highest discrimination (Area Under the Curve (AUC) = 0.903, 95% Confidence Interval (CI): 0.895, 0.911) followed by hospitalization (AUC = 0.822, CI: 0.818, 0.826), then escalation of care (AUC = 0.793, CI: 0.784, 0.805). Assuming a vaccine efficacy range of 70.8 to 78.7%, our simulations projected that targeted prevention in the highest risk group may have reduced 30-day hospitalization and death in more than 2 of 5 unvaccinated patients.


Introduction
The World Health Organization (WHO) estimates that the COVID-19 pandemic has resulted in over 521 million infections and 6.2 million deaths globally [1].High mutation rates and the relatively rapid emergence of SARS-CoV-2 variants led to multiple surges that have strained healthcare systems worldwide.The Omicron (B.1.1.529)variant became the predominant cause of SARS-CoV-2 infections in the U.S. by January 2022 [2,3], after identification in South Africa in November 2021 [4,5].Although Omicron variants and sub-variants have been linked to lower rates of hospitalization and death, [3,[6][7][8] Omicron-driven surges continued to challenge healthcare systems due to higher infectivity, partial vaccine escape, and antibody resistance [3,7].
Predictive modeling during the pandemic has provided crucial insight into clinical outcomes with COVID-19 infections; however, to date, these risk prediction tools have largely not included data for Omicron variants and have inconsistently incorporated important clinical factors such as vaccination status [9][10][11][12].In this study, we first applied machine learning (ML) models to identify baseline patient characteristics that predict risk for hospitalization, escalation of care, and mortality among SARS-CoV-2 positive US Veterans during a recent sevenmonth observation period (January 15 -August 15, 2022) when Omicron variants predominated.Our models incorporated previously under-utilized factors including vaccination status.Then, we extended our models to quantify the predicted impact of a mitigating strategy such as prioritized vaccination of high-risk groups on reducing the short-term risk of hospitalization, escalation of care, and death during the observation period.To do this, we utilized a wellcharacterized cohort of U.S. Veterans with SARS-CoV-2 infection in a national Veteran Health Administration (VHA) database.

Study cohort
Our study cohort consisted of all 172,814 Veterans who first tested positive for COVID-19 between January 15 and August 15, 2022, as captured by the VHA's COVID-19 Shared Data Resource with data curation within the VHA's Corporate Data Warehouse (CDW).No new data were collected, and no direct patient (or participant) contact took place.Patients' curated electronic health records in the VHA's CDW were analyzed behind the VHA secured firewall as part of the VHA research data initiative, Leveraging Electronic Health Information to Advance Precision medicine (LEAP, CSP#2012), which has been approved by VHA's Central Institutional Review Board and Research & Development Committees at 3 VA Medical Centers (Salt Lake City, Palo Alto, and West Haven).The VHA's CIRB approved a waiver of requirement to obtain informed consent.The date of the first positive test is defined as the index date.For the selected cohort within the data resource, there were no missing data for the selected fields and unknown covariates were indicated as such.Patients outside the age range of 18 to 100, outside the Body Mass Index (BMI) range of 15 to 100, or who experienced reinfection during the 8-month observation period were excluded from the analysis.

Study outcomes
We predicted the risk of developing one of the following three distinct, non-mutually exclusive clinical outcomes representing SARS-CoV-2 severity within 30 days of infection: (i) hospitalization, (ii) escalation of care (defined as the need for high-flow supplemental oxygen, mechanical ventilation, vasopressors, renal replacement therapy [with no prior dialysis in the preceding two years], or extracorporeal membrane oxygenation [ECMO]), and (iii) all-cause mortality.Patients who tested positive for SARS-CoV-2 were deemed to have 'mild' infection if they did not experience any of the three outcomes of interest within 30 days of infection.The Upset plot was generated using the UpsetR package [13].

Clinical features
A total of 159 patient characteristics including medical comorbidities, demographic data, vaccination status, and comorbidity indices were available for each patient prior to feature selection.The medical history included pre-existing conditions, procedures, and medications.All medical history values were classified using a Boolean system for presence or absence of the specific medical condition within two years prior to the current COVID-19 infection.Demographic and clinical data employed in the modeling included age, sex, race/ethnicity, blood type, BMI, veteran status, whether overweight at index date, rurality of current residence, and veteran priority status (a surrogate for income status and benefits eligibility).These covariates were multimodal (float, categorical and Boolean).Vaccination status was represented as a categorical score from 0 to 5 as follows: 0 = no vaccination, 1 = partial-mRNA vaccination, 2 = full vaccination (two doses of mRNA or a single dose of viral vector-based vaccine) > 5 months from index date, 3 = fully-vaccinated and boosted >5 months prior to the index date, 4 = fully-vaccinated <5 months prior to the index date, 5 = fully-vaccinated and boosted <5 months prior to the index date.Vaccines given outside of the VHA were available in the VHA COVID-19 Shared Data Resource and reflected in our dataset.Vaccination status accounted for a two-week efficacy window.Medical comorbidity burden was assessed by Charlson Comorbidity Index (CCI) [14] and Elixhauser Index [15] scores for the two years prior to infection.An overall CCI and Elixhauser index score was also determined.A complete list of covariates is included in S1 Table.

Model development and performance
For each of the 3 main outcomes of interest, we developed a distinct binary model that incorporated 159 unique covariate features using gradient boosting automated machine learning methods.A recursive feature elimination approach was used to find the most parsimonious models.Our data was split chronologically with training/validation data from January 15, 2022 to April 15, 2022 and our test data from April 16, 2022 to August 15, 2022.Covariates with variance lower than 1% within the training set were removed, and non-binary values were scaled from 0 to 1.
Model training and optimization were performed on the training and validation sets.The H2O AI package for automated machine learning was used to train each model and the validation set was used for benchmarking the optimization process [16].An initial heuristic search through available modeling methods using this package identified gradient boosting machines as the highest performers (S2 Table ).Stacked models were not considered due to low interpretability to performance tradeoff.All subsequent modeling was done using gradient boosting machines.Class imbalance within this study is a bias towards patients not having a severity outcome, and this was overcome by oversampling of the minority class where patients did have a severity outcome in training of the models to allow for higher predictive performance.The binary threshold for the models was calculated by finding the threshold with the max geometric mean for specificity and sensitivity on the test set.The 95% confidence intervals for the performance metrics were determined using the stat_util python package and its bootstrapping method with 100 iterations [17].
All reported performance metrics were generated on the set aside test set.Receiver operator characteristic (ROC) and precision recall curves and their respective area under curve (AUC) were calculated using the scikit-learn metrics package [18].The precision recall curves were normalized by using sample weights.

Model interpretation and applications
Feature importance values were extracted from the H2O generated models [16].Relative importance is calculated as the decrease in mean squared error weighted by the number of samples passing through a given node for all trees.The percentage reported here is the fraction of a given feature against all other feature relative importance values.
Shapley Additive exPlanations (SHAP) values were generated on the test set using the SHAP python package and a tree-based explainer [19].SHAP values were calculated on random sampling of 1,000 patients from the test set.Summary plots were generated by plotting the SHAP values in a bee swarm fashion.
For simulating the impact of targeted vaccinations, we selected the unvaccinated subset of our cohort from our test set.For each strategy scenario, we projected the potential reduction in outcomes if the patients were fully vaccinated (4 score in our vaccination status).The projection required two steps.The first was to project how many symptomatic infections would be prevented and thus prevent the outcome.To accomplish this, we randomly sampled and removed patients from our target group based on a published vaccine efficacy 95% CI range of 0.708 to 0.787 which we sampled from in a uniform fashion [20].The second was to project for the remaining patients in our target group whether being fully vaccinated would have prevented the outcome.For this we used our model and determined if their predicted outcome changed when we altered the vaccination status score from 0 to 4. We then summed the remaining outcomes in our target group to determine the reduction.The 95% confidence intervals for the projections were determined using the stat_util python package and its bootstrapping method with 100 iterations [17].

Patient population and clinical predictors of COVID-19 infection severity
In a national VHA cohort of 172,814 Veterans who first tested positive for SARS-CoV-2 during a period in which the Omicron variant predominated (January 15-August 15, 2022), the median age was 62 years and 84% were men (Table 1).The racial/ethnic composition of the cohort was typical for a US Veteran population; 65.5% of the patients were white, 19% were black, and 9.4% were Hispanic.Asian, Native Hawaiian or Pacific Islander, and American Indian or Alaskan Native Veterans each represented approximately 1% of the cohort.(Table 1).
Baseline characteristics of study cohort of U.S. Veterans who tested positive for SARS-CoV-2.Overall, 89.5% of Veterans had mild SARS-CoV-2 infections.Among Veterans who tested positive for SARS-CoV-2, 9.2% required hospitalization, 2.2% needed escalation of care, and 1.5% died (Table 1 and Fig 1).In the subset of hospitalized infected patients, a higher percentage required escalation of care (18%) and died (7%) compared to the overall cohort (Fig 1).Patients who died or required hospitalization and/or escalation of care were older and more likely to be male.Conversely, patients who had mild infections had a higher body mass index (BMI) than those who did not (Table 1).A higher percentage of patients who died were white, compared to the overall cohort (78.1% vs 65.5%).In contrast, a lower percentage of patients who died were black, compared to those in the overall cohort (11.3% vs. 19%) (Table 1).
Patients with non-mild infections had higher prevalence of diabetes, congestive heart failure, cerebrovascular disease, chronic kidney disease, and cirrhosis.Dementia was also more prevalent among patients who required hospitalization, required escalation of care, or died within 30 days after testing positive.While chronic lung disease also was more prevalent, diagnoses of asthma and bronchitis in the 2 years prior to infection was similar among mild and non-mild infections.
Our study included detailed vaccination data (Table 1).Over 29.1% of the overall cohort were unvaccinated (neither partially or fully vaccinated).Moreover, unvaccinated Veterans accounted for a disproportionately greater percentage of deaths (41.4%) compared to fully vaccinated and recently boosted (< 5 months) Veterans, who accounted for only 14.7% of the overall cohort and 11.8% of deaths.The more advanced the patients' vaccination status, the lower their contribution to deaths (Table 1).Similar trends were observed by vaccination status for the patient groups who required hospitalization or escalation of care (Table 1).

Model performance
After recursive feature selection evaluated the importance of 159 covariates, hospitalization had 25 relevant covariates, escalation of care had 75 relevant covariates, and mortality had 25 relevant covariates.The binary ML models predicted all 3 outcomes with good discrimination; all models had thresholds that maximized balance in performance, with sensitivity, specificity, and precision greater than 72% (Table 2).Consistent with its deterministic nature, death was predicted with better discrimination than the other outcomes, based on AUCs for both the receiver operator characteristic

Model interpretation
We evaluated the covariates that most predicted risks of hospitalization, escalation of care, and mortality within 30 days of a SARS-CoV-2 positive test during the observation period.Feature importance was measured as the fraction of total error reduction for a given covariate ( Fully vaccinated and boosted patients had lower predicted risks of hospitalization, escalation of care, and death at 30 days.Additionally, unknown blood type and alternative insurance were among the most significant predictors of a lower risk for hospitalization, while residing in non-rural areas and being male were among the most important predictors of mortality risk (

Projected impact of risk-prioritized vaccination strategies
To project the impact of targeted vaccination on adverse outcomes using the prediction models, we examined the unvaccinated subset (n = 22,082) from the test cohort (n = 92,080).We projected the number of adverse outcomes for three in silica scenarios: (1) vaccination of all Veterans within the unvaccinated subset, (2) random vaccination of 20% of the unvaccinated Veterans, and (3) vaccination of only the Veterans in the top quintile of predicted risk for adverse outcomes (Table 3).Using sensitivity tradeoff curves (S2 Fig) , we observed a step-up of predicted risk at the top quintile.Therefore, we selected the cut-off to be the top quintile of the population.In turn, our modeling projected the optimum impact of risk-prioritized vaccination strategy.Full vaccination of the entire unvaccinated population in our test set was predicted to reduce hospitalizations by 82.1% (from 1,698 to 304), escalations of care by 82.9% (from 351 to 60), and deaths by 84.4% (from 179 to 28.1).When a random 20% of the unvaccinated population was vaccinated in the projection modeling, hospitalizations were reduced from 1,698 to 1,504 (11.4% reduction), escalations of care from 351 to 313 (10.8%), and deaths from 179 to 161 (10.1%).When vaccinating the patients in the top quintile (20%) of the highest risk for adverse outcomes, hospitalizations were reduced from 1,698 to 1,017 (40.1%), escalations of care from 351 to 233 (33.6%), and deaths from 179 to 101 (43.6%).

Discussion
In a national cohort of 172,814 US Veterans who tested positive for SARS-CoV-2 during the Omicron surge, we demonstrated the most robust prediction discrimination to date for 30-day risk for hospitalization, escalation of care, and mortality after COVID-19 infection, using ML methods.Our ML models leveraged data including detailed vaccination status during the Omicron surge.We identified predictors for, and projected subgroups of, high-risk individuals who stand to benefit the most from advancing vaccination status.Prioritizing vaccination of individuals in the quintile of predicted risk for hospitalization or death was projected to produce greater than 3.5-fold projected reductions in hospitalization and death, compared to randomly vaccinating 20% of the population.
Previous prediction models, including those developed in the VHA, utilized data collected prior to the emergence of the Omicron SARS-CoV-2 variant [9][10][11][12].A large retrospective analysis of over 1.5 million vaccinated patients in the VHA showed relatively low rates of breakthrough infections and related complications such as pneumonia and death [21].This statistically powerful investigation excluded unvaccinated individuals and anyone with a prior history of COVID-19 infection, and risk prediction modeling was not a primary focus of that report.Although a prior smaller study incorporated vaccination into ML risk prediction modeling for COVID-19 [22], our study incorporated stratified vaccination status, which reflects degree of protection through number and timing of primary and booster vaccines in an ML-driven risk prediction model.
Compared to recent studies, ML models in the present study demonstrated more robust discrimination by AUC in predicting 30-day risk for hospitalization (AUC 0.822), escalation of care (AUC 0.793), and mortality (AUC 0.903) with COVID-19 infection.Two prior studies derived from cohorts of ~4,500 patients each demonstrated lower AUCs (0.804 and 0.813) for predicting hospitalization [23,24].A previous model developed from a large VHA cohort of 7,635,064 (both infected and non-infected) with an observation window from May 21 to November 2, 2020 predicted 30-day mortality with a validation AUC of 0.836 (95% CI, 82.0%-85.3%)[9].In addition, a recent study of 1,201 patients who contracted SARS-CoV-2 in Spain in 2020 predicted 30-day mortality with an AUC of 0.872 [25].Commonly identified covariates in prior studies, advanced age and higher medical comorbidity indices, were associated with higher risks for the adverse outcomes of interest in our models [9][10][11].Our models identified a general inverse association between BMI and predicted risk for adverse outcomes.This contrasts a prior meta-analysis that demonstrated that higher BMI (and visceral adiposity) correlates with a higher risk of hospitalization, mortality, and other adverse outcomes such as admission to ICU and need for mechanical ventilation [26].
Consistent with prior vaccine trials [27], our study indicated that vaccination reduces hospitalizations, escalation of care, and deaths.Individuals who were fully vaccinated and boosted within 5 months from testing SARS-CoV-2 positive had the greatest projected protection.Use of oral anticoagulants in the two years prior to current infection strongly predicted 30-day hospitalization and escalation of care.The biological basis of this observation may be related to the underlying medical conditions that warranted anticoagulation or to specific effects of the anticoagulants themselves.Notably, baseline furosemide use was associated with a higher risk of death, suggesting that underlying heart failure or volume-expanded states are important determinants of infection severity in Omicron infections.

Limitations
The present findings in this national study of US Veterans may not be broadly applicable to the general population.Consistent with the US Veteran population, our study cohort was predominantly male and white with greater medical comorbidity and lower socioeconomic status than the general US population.The relevance of the models remains limited for racial/ethnic minority communities who have borne a disproportionate burden during the pandemic.However, the methodology used here can be applied and adapted to other populations or health care systems.Additionally, while some recent work has sought to remove confounding effects from machine learning models in imaging [28,29], these statistical approaches can lead to biases in estimating predictive modeling performances [30,31].While statistical analysis is best suited for estimating the causality of features on outcome, here we sought to optimize robust predictive performance through machine learning and highlight predictive features.
For vaccine projections, all outcomes of interest were assumed to be the result of SARS-CoV-2 infection.While the VHA COVID-19 Shared Data Resource database captures all deaths, it does not capture hospitalizations and care received outside the VA.This may explain why having other non-VHA insurance was associated with lower rates of 30-day hospitalization given that patients with non-VHA insurance may have sought care outside the VA.The VHA COVID-19 Shared Data Resource database also does not establish whether SARS-CoV-2/ COVID-19 is the reason for hospitalization, escalation of care, or death.Determining this is challenging.Our modeling also does not include laboratory or imaging data; these data have been shown to have robust predictive value post index date [32][33][34][35].Finally, the model results were most relevant to Omicron variants and sub-variants and may not be relevant to other pathogenetic SARS-CoV-2 variants.

Conclusions
Our ML risk prediction modeling approach provides robust discrimination in predicting hospitalization, escalated hospital care and death within 30 days of testing positive for SARS-CoV-2 infection during a recent observation period in which Omicron variants are the major cause of COVID-19.It can inform health care system vaccination and resource allocation decisions by characterizing individuals and subpopulations at low-to-high risk for 30-day hospitalization, escalated hospital care or death, and identifying those who might benefit leastto-most from preventive intervention.While this modeling was developed specifically for the Omicron variant surge, analogous modeling can be developed and implementable rapidly in real-time to guide vaccination strategies and resource allocation during future COVID-19 surges.

Fig 1 .
Fig 1. Upset plot of non-exclusive 30-day outcomes of interest in US Veterans.A dot in each row represents patients experiencing that outcome at any time within 30 days after testing positive.The vertical line connecting two (or more) dots represents patients who experienced two or more of the outcomes at any time within 30 days after testing positive.https://doi.org/10.1371/journal.pone.0290221.g001

Fig 2 .
Fig 2. Classification performance curves with respective area under curve (AUC) and 95% confidence intervals.(A) Receiver Operating Characteristic (ROC) curve for each model with respective false positive and true positive rates at the classification thresholds indicated by black dots.(B) Normalized precision recall curve for each 30-day outcome.https://doi.org/10.1371/journal.pone.0290221.g002 Fig 3).We generated SHAP summary plots to show the impact of covariate values on predictive output (S1 Fig).Advanced age was the second most predictive covariate for hospitalization (Fig 3A and S1A Fig).It was also the most predictive covariate for escalation of care (Fig 3B and S1B Fig) and mortality, accounting for more than 50% of relative importance (Fig 3C and S1C Fig).Weighted indices of comorbid illnesses, the Charlson Comorbidity index (CCI)[14] and Elixhauser index[15], were more robust predictors of the adverse outcomes than individual cardiometabolic, renal, and respiratory conditions (Fig3).BMI was highly predictive of the outcomes; BMI was inversely proportional to predicted risk, based upon SHAP analysis (Fig 3 and S1 Fig).Veterans taking an oral anticoagulant at any time in the two years prior to testing positive for SARS-CoV-2 had higher risks of hospitalization and need for escalation of care (Fig 3A, 3B and S1A, S1B Fig).Patients who had been prescribed vasopressors at any time in the prior two years had a higher predicted risk for escalation of care, while patients on the diuretic, furosemide, had higher predicted risk for mortality (Fig 3B, 3C and S1B, S1C Fig).

Fig 3 .
Fig 3. Clinical feature importance plot.(A) hospitalization, (B) escalation of care, and (C) mortality.Feature importance values for each of the three outcomes of interest are presented as a percentage, which is indicative of the fraction of error reduction that a given feature contributed to the model.https://doi.org/10.1371/journal.pone.0290221.g003